Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various bugs in split/apply/combine in 0.21 release #2280

Merged
merged 9 commits into from
Jun 23, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 10, 2020

In this PR I fix some corner cases that were bugs in 0.21 release. It should be merged before #2279 as it should be backported.

What it covers:

  • documentation improvements of what happens (and should happen in corner cases, especially when something is "empty", where something is: GroupedDataFrame, DataFrame, list of transformations, or groupcols)
  • fix a bug with GroupDataFrame indexing that caused StackOverflow
  • fix a bug in creation of GroupedDataFrame with ungroup=false (wrong columns could be shown as grouping columns in the past)
  • fix cases of "empty" stuff passed (listed above), especially when keepkeys=false or ungroup=false (there are several corner cases here); most likely in real usage scenarios these cases do not happen, but still we should provide a correct result here.

In general correctly handling all these corner cases is tricky so probably if someone would be willing to review it it might be challenging (so thank you in advance for the efforts). I will also try to improve test coverage for this if I find other "corner cases" still not covered.

@bkamins bkamins added bug priority grouping non-breaking The proposed change is not breaking backport labels Jun 10, 2020
@bkamins bkamins added this to the 1.0 milestone Jun 10, 2020
@bkamins bkamins requested a review from nalimilan June 17, 2020 16:00
# in this case we are sure that the result GroupedDataFrame has the
# same structure as the source
# we do not copy data as it should be safe - we never mutate fields of gd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we mutate fields now?

Copy link
Member Author

@bkamins bkamins Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not mutate fields now.
It is a preparation for filter!. But I prefer to make a copy here as it is not very expensive, and be safe that in the future changing if someone introduces mutation to GroupedDataFrame (even if we skip filter! for now as I assume we will do) it does not get forgotten (as it is easy to forget that this line in such a large codebase assumes that it is not mutated).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid making copies preventively as long as we never mutate fields. Especially if we drop filter! for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - changed. I have made an extensive implementation note, as GroupedDataFrame is currently quite tricky to implement new functionality for.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Jun 19, 2020

Code that works with a nonzero number of groups is likely to try to access new columns and therefore fail later.

This is what I was afraid of exactly. I will open a separate issue or PR to track this.

bkamins and others added 2 commits June 19, 2020 17:50
@bkamins
Copy link
Member Author

bkamins commented Jun 22, 2020

coverage only fails.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two suggestions.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit e607175 into JuliaData:master Jun 23, 2020
@bkamins bkamins deleted the fix_select_combine_empty branch June 23, 2020 17:07
@bkamins
Copy link
Member Author

bkamins commented Jun 23, 2020

Thank you!

@@ -34,15 +50,19 @@ end
function Base.getproperty(gd::GroupedDataFrame, f::Symbol)
if f in (:idx, :starts, :ends)
# Group indices are computed lazily the first time they are accessed
Threads.lock(gd.lazy_lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, shouldn't this line be moved inside if getfield(gd, f) === nothing? Ideally threads wouldn't have to lock each other when accessing indices once they have been computed. To avoid threads from computing the fields multiple times (which would be wasteful, though probably not problematic), we could check getfield(gd, f) === nothing again after calling lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I was also thinking about it. The solution is exactly as you say:

we could check getfield(gd, f) === nothing again after calling lock.

I just wanted to have a simpler implementation as lock/unlock is cheap:

julia> function f(l)
       lock(l)
       unlock(l)
       end
f (generic function with 1 method)

julia> l = Threads.ReentrantLock()
ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0)

julia> f(l)

julia> @benchmark f($l)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     26.066 ns (0.00% GC)
  median time:      26.087 ns (0.00% GC)
  mean time:        27.893 ns (0.00% GC)
  maximum time:     77.285 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

and with threads often things are more tricky than they seem.

I will open a PR with the change you propose and we can discuss there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants